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Abstract 

Background: Many methods have been developed to infer and reason about molecular interaction networks. 
These approaches often yield networks with hundreds or thousands of nodes and up to an order of magnitude 
more edges. It is often desirable to summarize the biological information in such networks. A very common 
approach is to use gene function enrichment analysis for this task. A major drawback of this method is that it 
ignores information about the edges in the network being analyzed, i.e., it treats the network simply as a set of 
genes. In this paper, we introduce a novel method for functional enrichment that explicitly takes network 
interactions into account. 

Results: Our approach naturally generalizes Fisher's exact test, a gene set-based technique. Given a function of 
interest, we compute the subgraph of the network induced by genes annotated to this function. We use the 
sequence of sizes of the connected components of this sub-network to estimate its connectivity. We estimate the 
statistical significance of the connectivity empirically by a permutation test. We present three applications of our 
method: i) determine which functions are enriched in a given network, ii) given a network and an interesting sub- 
network of genes within that network, determine which functions are enriched in the sub-network, and iii) given 
two networks, determine the functions for which the connectivity improves when we merge the second network 
into the first. Through these applications, we show that our approach is a natural alternative to network clustering 
algorithms. 

Conclusions: We presented a novel approach to functional enrichment that takes into account the pairwise 
relationships among genes annotated by a particular function. Each of the three applications discovers highly 
relevant functions. We used our methods to study biological data from three different organisms. Our results 
demonstrate the wide applicability of our methods. Our algorithms are implemented in C++ and are freely 
available under the GNU General Public License at our supplementary website. Additionally, all our input data and 
results are available at http://bioinformatics.cs.vt.edu/~murali/supplements/2011-incob-nbe/. 



Background 

The functioning of a living cell is governed by an intri- 
cate network of interactions among different types of 
molecules. These interactions transduce external signals, 
control gene expression, protein synthesis and localiza- 
tion, chemically modify protein activities, and drive 
metabolic and biochemical reactions. Considerable effort 
in molecular and cellular biology has been expended 
over the last 50 years by individual research groups on 
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testing and detecting interactions on a small scale. The 
results of these experiments are enshrined in the litera- 
ture. In the last few years, a number of efforts have 
manually curated the literature and created databases of 
these interactions [1-3]. More recently, the genomic 
revolution has inspired the development of experimental 
technologies that can detect interaction networks in a 
high-throughput manner and on a genome-wide scale. 
For example, the yeast 2-hybrid screen has been scaled 
up to unveil protein-protein interaction networks con- 
taining tens of thousands of interactions in a number of 
organisms [4,5]. In a similar vein, the chromatin 
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immunoprecipitation on a microarray (ChlP-on-chip) 
technology allows the detection of the targets of a speci- 
fied transcription factor on a genome-wide scale [6]. 
These developments have made molecular interaction 
networks pervasive in systems biology. 

Concomitantly, a number of computational 
approaches have been developed to analyze networks 
and their properties. Foremost among them are methods 
to reverse engineer gene regulatory networks by inte- 
grating gene expression data with other types of 'omic 
data [7]. Such interactions usually relate the expression 
of a gene to that of other genes in the cell [8]. Another 
broad class of methods overlay gene expression data for 
a condition on the wiring diagram to compute the cell's 
response network for that condition [9-11]. Networks of 
these types can contain hundreds or thousands of nodes 
and an order of magnitude more edges. For example, 
the B-cell interactome [12,13], a network of experimen- 
tally verified or computationally predicted protein-DNA, 
protein-protein, and transcription factor-modulator 
interactions, contains nearly 6000 nodes and over 
64,000 edges. It is often desirable to summarize the bio- 
logical information in such networks. A very common 
approach is to perform enrichment analysis of the terms 
in some catalog such as the Gene Ontology [14-16]. 
Enriched functions and processes are very useful in 
summarizing the main biological themes of a large net- 
work at a high level, as a preliminary to more detailed 
mechanistic studies. When applied to reverse-engineered 
or response networks, a major drawback of functional 
enrichment is that it ignores information about the 
edges in the network being analyzed, i.e., it treats the 
network simply as a set of genes. Therefore, a function 
may appear to be enriched in a network, but the genes 
annotated with that function may be highly discon- 
nected within the network. In such cases, it is difficult 
to interpret the relevance of that function to the 
network. 

In this paper, we introduce a novel method for func- 
tional enrichment that explicitly takes network interac- 
tions into account. Our approach naturally generalizes 
Fishers exact test, a widely-used gene set-based techni- 
que. We use the sequence of sizes of the connected 
components of the network to estimate its degree of 
connectivity. We estimate the statistical significance of 
this connectivity empirically by a permutation test. 

It may be argued that one approach that mitigates the 
drawback of using gene set enrichment on networks is 
to find clusters within the network and then compute 
enriched functions within them. Since clusters are 
usually densely-connected, enriched functions are likely 
to induce connected subgraphs within clusters. Our 
approach is distinct from finding enriched functions in 
clusters. Clustering algorithms typically compute dense 



subgraphs. In contrast, we detect subgraphs (defined by 
genes annotated with specific functions) that are more 
connected than may be expected at random. In our 
results, we show examples of subgraphs that clustering 
algorithms may not detect. 

We showcase three applications of our approach: i) 
determine which functions are enriched in a reverse- 
engineered network, ii) given a network and a response 
sub-network of genes within that network, determine 
which functions are enriched in the sub-network, and 
Hi) given two experimentally derived networks, deter- 
mine the functions for which the connectivity improves 
when we merge the second network into the first. These 
analyses use data from three different species (human, 
rat, and bakers yeast). They demonstrate that our con- 
ceptualization of network-based functional enrichment 
is a powerful framework for addressing a diverse variety 
of interesting biological questions about molecular inter- 
action networks. 

Formulation of functional enrichment 

The problem of functional enrichment is usually formu- 
lated as follows. We have a universe U of genes and an 
"interesting collection" C <= U of genes. We desire to 
evaluate the functional coherence of C based on the 
annotations of the genes in it. To this end, we have 
access to a set F of biological functions. For each func- 
tion fe F, let UfQ U denote the set of genes annotated 
by / Furthermore, let Cf = C n Uf denote the subset of 
genes in C that are annotated by / We will use lower- 
case letters to denote the cardinalities of the corre- 
sponding sets, which are named by uppercase letters. 
See Figure 1 (top) for an illustration. The standard for- 
mulation of functional enrichment in terms of the one- 
sided version of Fisher's exact test asks the following 
question: 

If we select a set X of Uf genes uniformly at random 
(without replacement) from the set of all genes U, what 
is the probability that In C will contain Cf or more 
genes? 

Thus, we are interested in the probability that a ran- 
dom set of Uf genes would contain Cf or more genes 
from C. As is well known, we can compute this prob- 
ability by: 
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Now let us consider the setting in which network- 
based functional enrichment is relevant. We are given 
an undirected graph G - (U, I), where U is the set of all 
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Figure 1 Functional enrichment cartoons. (Top) The standard formulation of functional enrichment computes the statistical significance of the 
size of Cf, the overlap between an interesting collection C of genes and the set U f of genes annotated with function f. (Bottom) Our network- 
based approach to functional enrichment computes the statistical significance of the connectivity of/-//- the network induced by the intersection 

of the interesting collection C and the set U f of genes annotated by the function. 
^ J 



genes (as before) and / is a set of edges. We are also 
given a subgraph H = (C, J) of G, where C (the "interest- 
ing collection" of genes) is a subset of U, and / is a sub- 
set of /. Note that H may not be the subgraph of G 
induced by C, i.e., some edges of / that connect node 
pairs in C may be missing from /. Given a function/, 
define Hf to be the subgraph of H induced by the genes 
in Uf n C, i.e., the subgraph of H induced by the genes 
that are annotated by / We desire to use statistics of Hf 
to estimate whether the function / is enriched in H or 



not. This concept is illustrated in Figure 1 (bottom). 
Intuitively, even if Hf is highly disconnected, existing 
methods may declare / to be highly enriched in H. How- 
ever, it is difficult to interpret the biological relevance of 
such a function. Therefore, we would like to incorporate 
the connectedness of Hf into its evaluation. 

Ideally, / is highly statistically significant if Hf contains 
only one connected component, and /is statistically 
insignificant if Hf contains only singletons (or many 
small components). We define the size of a connected 
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component as the number of nodes in that component. 
Suppose Hf has m connected components whose sizes 
are a lf <2 2 >—> where a t > a i+1 , 1 < i <m. Using the 
abbreviation "cs" for "component sizes," we will use cs 
(Hf) to denote this non-increasing sequence of numbers. 
We would like to estimate the statistical significance of/ 
in terms of cs(Hf). If X is a subset of U, we abuse nota- 
tion and use H x to denote the subgraph of H induced 
by X. Drawing a parallel with the formulation of func- 
tional enrichment in the network-free case, we pose the 
following question: 

If we select a set X of Uf genes uniformly at random 
(without replacement) from the set of all genes U, what 
is the probability that cs(H x ) > cs(Hj)l 

Clearly, answering these questions requires that we 
define how we can compare different values of csQ, i.e., 
decide if one sorted sequence of connected component 
sizes is "greater than" another. Note that two distinct 
subsets X and Y of U may induce different subgraphs of 
H (i.e., H x * Hy) that have the same sequence of com- 
ponent sizes. Conversely, H x and H Y may have the same 
number of nodes, yet cs(H x ) and cs(H Y ) may be vastly 
different. (Consider the case where \X n C| = \Y n C|, 
and H x is a clique while H Y is a collection of singletons.) 
It appears difficult to determine the null distribution of 
cs(H x ) analytically. Thus, we developed two sampling- 
based approaches, discussed in "Methods", to compute 
the statistical significance of cs(Hf) empirically. 

Methods 

As before, let G = (U, I) be a network whose nodes are 
the universal set of genes and let H - (C, J) be a net- 
work whose nodes are the interesting collection of 
genes. We sometimes refer to G and H as the universal 
network and the interesting network, respectively. Let Hf 
be the subgraph of H induced by the nodes annotated 
with function / 

First, we elaborate on our method for comparing dif- 
ferent values of csQ, i.e., sorted lists of network compo- 
nent sizes. Next, with a method to compare 
subnetworks in hand, we proceed to compute the statis- 
tical significance of cs(Hf) empirically using two sam- 
pling-based approaches. 

Comparing sequences of component sizes 

Let A = <2 2 >—> dm) and B = {b x , b 2 , b n } be sorted 
sequences such that a t > a M for 1 < i <m, and b t > b i+ i 
for 1 < i <n. Since the a/s and b/s represent component 
sizes, we assume all values are positive. If m <n, we pad 
A with zeros by setting a t = 0 for m <i < n. We pad B 
similarly in the case that n <m. 

Now, we naturally define A = B if and only if a t - b t 
for all 1 < i < max(m, n). Otherwise, we define A <B if 
and only if there exists some index i, 1 < i < max(m, n), 



such that di <b h and <z ; - = bj for all / <i. If neither of 
these cases hold, we say that B <A. Essentially, we walk 
along A and B simultaneously (which have the same 
length after padding) until we find an index i where a t * 
b t . The smaller sequence is the one that contains the 
smaller of those two values. 

Function randomization 

The function randomization approach for computing 
the statistical significance of cs(Hf) parallels the sam- 
pling-based alternative to the analytical solution for the 
one-sided version of Fishers exact test (Equation 1). In 
the sampling-based solution for Fishers exact test (that 
is not network-based), we repeatedly select a set X of Uf 
genes uniformly at random from the universe of genes 
U. We then calculate an empirical p-value for the func- 
tion f, which represents the statistical siginificance of 
the size of Cf as the fraction of samples for which \X n 
C| > Cf i.e., the size of the intersection between the ran- 
domly selected set X and the interesting collection of 
genes is at least as large as Cf If we repeat this process 
many times, the empirical value converges to the analy- 
tical value computed by Equation 1. 

In the case of network-based enrichment, we have not 
been able to derive the null distribution of cs(H x ) analy- 
tically. Therefore, we apply a sampling-based algorithm 
similar to the one used in the non-network case. We 
repeatedly select a set X of Uf genes uniformly at ran- 
dom from the universal genes U. At each iteration we 
compute the subgraph of H = (C, J) induced by genes in 
X n C and call this subgraph H x < We calculate the p- 
value for /as the fraction of random choices of X for 
which cs(H x ) > cs(Hf). This p-value is an empirical esti- 
mate of the probability that the intersection between the 
interesting network and a randomly-selected subgraph 
of the universal network is more well-connected than 

This network-based formulation of functional enrich- 
ment strongly parallels the traditional non-network for- 
mulation. In fact, notice that if we remove all edges 
from G and from H, the network-based solution is 
exactly the same as the one-sided version of Fisher's 
exact test. In this sense, we generalize a standard func- 
tional enrichment approach for gene sets to networks 
induced by sets of genes. 

Network structure randomization 

The function randomization approach for computing 
network-based functional enrichment tests the depen- 
dency of cs(Hf) on the size of the set of genes annotated 
with / However, Hf also depends on the specific interac- 
tions present in the underlying universal network G - 
(U, 7). To test the dependence of cs(Hj) on G, we devel- 
oped a separate method that relies on randomizing the 
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structure of G. We use the following algorithm [17] to 
generate a randomized universal network with the same 
degree sequence as G, i.e., the degree of each node is 
the same in G and in the new network. This algorithm 
preserves topological properties of molecular interaction 
networks, such as their scale-free degree distribution. 

RANDOMIZER = (U, I), k) : 

for 1 < x < fe 1 1 1 : 

Randomly select two edges {a, b), [c, d) e Y. 

If a = d or b = c, do nothing. 

If [a, d) £ Y or {b, c) e Y, do nothing. 

Remove [a, b) and (c, d) from Y. 

Add {a, d) and (c, b) to Y. 
return G' = {U,Y) 

The first i/ statement in RANDOMIZE prevents the 
algorithm from producing self-loops in the random net- 
work. The second statement ensures that the edges to 
be added do not already exist in F, thereby keeping the 
resulting graph simple. We perform the edge swap ran- 
domization for k\I\ iterations, where k is an external 
parameter to this algorithm. We set k = 10 for all analy- 
sis presented in this paper. We determined this value 
for k by running RANDOMIZE with multiple values of 
k on the networks discussed in "Results". For each value 
of k, we analyzed the distribution of the overlap between 
the edges in a random network and the original net- 
work. Increasing k beyond 10 did not show any signifi- 
cant change in this distribution (compared to k - 10) 
for any of the original networks. 

In order to assess the significance of cs(Hf) for each 
function, we repeatedly generate a randomized universal 
network G' = RANDOMIZER, k). Let H' = (C, /') be 
the subgraph induced in G' by the genes in C, and let 
H'f be the subgraph of H' induced by the genes anno- 
tated with function f; H' and H j are subgraphs of G' in 
the same fashion as H and Hf are subgraphs of G. We 
calculate the ^-value for each function /as the fraction 
of iterations (i.e., random networks) for which 
cs(H^) > cs(H f) , This p-vdlue indicates the probability 
that the genes in Hf are more connected after randomly 
shuffling the edges in G (including those in Hf) while 
maintaining the degree sequence of G. 

Combining p-values 

We developed the function randomization approach 
because of its strong ties to the standard way we inter- 
pret functional enrichment. Furthermore, it can be 
expressed as an extension of the one-sided version of 
Fisher's exact test. This approach is an attempt to 



answer the following question: "Given a function /that 
annotates Uf genes, what is the probability that a ran- 
domly-selected set of Uf genes will have a more con- 
nected intersection with the interesting genes than Hf 
does?". 

As an alternative, we developed the network structure 
randomization approach because we were interested in a 
strictly network-based formulation of functional enrich- 
ment. This approach is an attempt to answer a very dif- 
ferent question from the previous method: "Given a 
function/, what is the probability that the subgraph 
induced by genes in Cf will be more connected in a net- 
work chosen uniformly at random from the set of all 
networks with the same degree sequence as G?". 

In this paper, we only consider those functions 
deemed well-connected by both approaches. Thus, we 
assign a ^-value for each function that is the maximum 
of the two ^-values from each of the two approaches. In 
the rest of the paper, unless stated otherwise, the p- 
value for a function refers to the maximum of the p- 
values computed by both approaches. 

Results 

We present applications of our methods to three differ- 
ent interaction networks, each for a different organism. 
In the first application, we identify biological processes 
from the Gene Ontology (GO) [18] with significant net- 
work-based enrichment in the human B cell interactome 
[19]. In the second application, we discuss gene sets 
from the Molecular Signatures Database [20] that are 
deemed significantly enriched in a response network 
that represents the differences between collagen sand- 
wiches and hepatocyte monolayers, two widely-used sys- 
tems to culture rat hepatocytes [21]. In the third 
application, we discover GO biological processes whose 
network-based enrichment improves when we add a col- 
lection of genetic interactions in S. cerevisiae that were 
identified in a large-scale study conducted by Krogan et 
al [22] to the BioGRID network [23]. 

Each of these three applications of network-based 
enrichment showcases how our methods can be used in 
a different manner: /) determine which functions are 
enriched in a given network, ii) given a network and an 
"interesting" sub-network of genes within that network, 
determine which functions are enriched in the sub-net- 
work, and Hi) given two networks, determine the func- 
tions whose connectivity improves by merging the 
second network into the first. These three applications 
ask different questions about the functional enrichment 
of an interaction network, each of which we can answer 
using the framework presented in "Methods". 

We used the Synergizer [24] to generate all gene and 
protein identifier mappings. We visualized all networks 
using Cytoscape [25]. 
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B cell interactome 

To analyze functions enriched in human B cells, we 
applied our methods to the B cell interactome (BCI) 
[19], which incorporates three different types of interac- 
tions: protein-protein, protein-DNA, and transcription 
factor-modulator. The BCI includes both experimentally 
verified and computationally predicted interactions 
[12,13] in human B cells. Protein-DNA and transcrip- 
tion factor-modular interactions are directed. For the 
purposes of this work, we treated them as being undir- 
ected. We removed repeated edges and self-loops from 
the BCI, resulting in an interaction network with 5748 
nodes and 64,007 edges. We applied our network-based 
enrichment methods to identify GO biological processes 
enriched in the BCI. After downloading functional anno- 
tations from the GO website [18], we applied the true- 
path rule to the annotations, i.e., if a gene was annotated 
with function / we ensured that it was also annotated 
with any ancestors of/ in the GO directed acyclic graph. 

Recall that our enrichment methods require a univer- 
sal network G and an interesting network H. Since we 
were simply interested in the set of functions enriched 
in the BCI, we let both the universal network and the 
interesting network be the BCI. Notice that the standard 
formulation of gene set enrichment (Equation 1) will 
assign a /rvalue of 1 to every function, since the size of 
the interesting set of genes and the size of the universe 
are equal (i.e., u = c and Uf = cy). As a result, we cannot 
compare our method for discovering enriched functions 
to the set-based Fishers exact test for this application. 
However, our approach assigns non-trivial ^-values to 
several relevant functions, as we show below. 

We computed ^-values by executing each random 
sampling approach 100,000 times. We retained those 
functions that annotated at least five genes in the uni- 
versal network and no more than 100 genes in the inter- 
esting network, resulting in a collection of 2098 GO 
biological processes. Of these, 31 processes received a 
network-based enrichment ^-value of less than 10~ 5 , 
which is the smallest empirical ^-value possible over 
100,000 iterations. Next, we present three significantly- 
enriched functions and discuss their relationship to the 
B cell interactome. The goal of this analysis is to 
demonstrate that network-based enrichment methods 
were capable of finding meaningful functions in a net- 
work associated with a specific cellular context. Figure 2 
illustrates the subnetwork induced by the genes anno- 
tated with each of the three functions. In other words, 
each of the networks in Figure 2 is a visualization of the 
subgraph Hf corresponding to one of three discussed 
functions. These figures only retain nodes that are inci- 
dent on at least one edge. While we do not incorporate 
the directionality of the protein-DNA and transcription 
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Figure 2 Functions related to B cells. Subgraphs of the BCI 
induced by genes annotated with (Top) GO:0065004 Protein-DNA 
Complex Assembly, (Middle) GO:0051251 Positive Regulation of 
Lymphocyte Activation, and (Bottom) GO:0006006 Glucose 
Metabolic Process. The red, blue, and green edges represent 
protein-protein, protein-DNA, and transcription factor-modulator 
interactions, respectively. 
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factor-modulator interactions in our methods, we con- 
sidered the directionality when visualizing the results. 
We discuss functions that were ranked 1, 32, and 43 
overall Note that 31 processes received the smallest p- 
values. Hence, all of them have the rank of 1. We 
observed that a majority of these functions were related 
to protein phosphorylation and kinase signaling cas- 
cades. The term we discuss next is one that we could 
relate to B-cells based on the proteins annotated by that 
term. 

Protein-DNA complex assembly 

The GO term "Protein-DNA Complex Assembly" 
(GO:0065004) was enriched in the BCI with a ^-value of 
less than 10~ 5 (rank 1 out of 2098 functions). Figure 2 
demonstrates that the genes annotated by this function 
form several densely-connected subgraphs. A clustering 
algorithm may have detected each individual subgraph 
but may not have included all of them in a single clus- 
ter. Thus, the term "Protein-DNA Complex Assembly" 
may be determined to be enriched (using set-based 
approaches) in the results of a clustering algorithm only 
if the clusters were themselves further grouped during a 
post-processing procedure. 

The genes in the subgraph induced by this function 
are associated with various aspects of the aggregation 
and binding of proteins with DNA molecules to form a 
protein-DNA complex. For example, the nine HIST and 
four H2A proteins are grouped together. DNA wraps 
around the histones, forming higher-order complex pro- 
tein-DNA subunits. Many eukaryotic genes contain a 
DNA signature in their promoter region known as the 
TATA box. These genes rely on the ordered assembly 
of RNA polymerase II and several initiation factors at 
the TATA box. The ten POLR genes that form a clique 
encode different subunits of RNA polymerase II. The 
eight densely-connected general transcription factors 
(GTF2 genes in Figure 2) serve as initiation factors for 
RNA polymerase II. The eleven neighboring TAF genes 
perform a similar role. This function was likely deter- 
mined to be enriched because of the hub transcription 
factor MYC, which serves as a bridge between these 
densely-connected sub-networks. Since MYC is respon- 
sible for regulating many other genes, irregular expres- 
sion of MYC has been linked to several cancers, and 
some studies have identified this gene as a potential 
cancer drug target in humans [26]. 
Lymphocyte activation 

We found the term "Positive Regulation of Lymphocyte 
Activation" (GO:0051251) enriched in the BCI with a p- 
value of 10~ 5 (rank 33). We expected to discover this 
function as enriched in the BCI: since B cells are a spe- 
cific type of lymphocyte, up-regulation of lymphocyte 
activation is an inherent property of genes in the BCI. 
Our methods were capable of determining not only that 



many genes related to lymphocyte regulation appear in 
the BCI, but that there was a rich interconnectivity 
among these genes. Figure 2 illustrates that the signal 
transducer and activator of transcription (STAT) pro- 
teins (STAT5A, STAT5B, STAT6) are central to this 
connectivity. STAT5 has been shown to play a key role 
in the development and proliferation of B cells [27,28] 
through its activation by and mediation of many inter- 
leukins (e.g., IL2, IL6, and IL7, which are also present in 
this network). Each of these interleukins serves as a 
growth factor for various B cell lineages. 
Glucose metabolism 

We found the term "Glucose Metabolic Process" 
(GO:0006006) enriched in the BCI with a ^-value of 2 x 
10 -5 (rank 43). While glucose metabolism is not solely 
related to the behavior of B cells, this process is impor- 
tant for many cell types, including B cells, as a primary 
source of energy. Perhaps what was most interesting 
about this function was its reliance on MYC for most of 
its connectivity. As mentioned, MYC is considered to be 
a master regulator because it regulates the activity of a 
large number of human genes. Figure 2 demonstrates 
that without the annotation of MYC by "Glucose Meta- 
bolic Process" the network would consist of mostly dis- 
connected proteins. While MYC also plays a central role 
in "Protein-DNA Complex Assembly", one could argue 
that, even without MYC and the edges incident on it, 
the proten-DNA complex assembly network is reason- 
ably well-connected and will be deemed to be signifi- 
cantly enriched by our methods. In the case of "Glucose 
Metabolic Process," we elucidated a function whose net- 
work-based enrichment score relies heavily on its con- 
nectivity through a central hub. 

Hepatic cultures 

The liver carries out a multitude of necessary functions 
in humans and many other animals, including the meta- 
bolism of foreign compounds (xenobiotics) and choles- 
terol. Hepatocytes constitute roughly 70-80% of the liver 
cells. Two commonly used systems for culturing these 
cells in vitro are the hepatocyte monolayer (HM) and 
the collagen sandwich (CS). Briefly, the HM consists of 
a layer of collagen on top of which a single layer of 
hepatocytes are placed; the CS is similar to the HM 
with the addition of an extra layer of collagen on top 
(creating a "sandwich" of hepatocytes between two 
layers of collagen). A recent study analyzed the expres- 
sion profile over eight days of genes in primary rat 
hepatocytes in both HM and CS tissue cultures [21]. 
The authors demonstrated that over eight days, the 
genes annotated by a wide range of liver-specific biologi- 
cal processes were consistently up-regulated in hepato- 
cytes in CS but not in HM and that processes related to 
the cell cycle were down-regulated in CS, as compared 
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to HM. Hepatocytes in HM differentiate and lose some 
of their morphology over time, contributing to their loss 
of liver-specific functions. 

For this application of our enrichment methods, we 
sought to discover functions that annotate differentially- 
expressed genes in the CS cultures when compared to 
HMs. We used the STRING database of known and pre- 
dicted protein interations [29] as our universal interac- 
tion network. STRING assigns a weight between 0 and 
1000 to each interaction based on the combined scores 
of various sources of evidence for the interaction, where 
higher scores indicate more confidence for that interac- 
tion. We removed any self-loops or interactions with a 
score below 500 from the universal network, resulting in 
a network of 9925 nodes and 204,992 edges. We ana- 
lyzed the expression profiles for rat hepatocyte genes in 
HM versus CS cultures after 8 days of growth [21], 
since CS showed the greatest divergence from HM after 
8 days. Based on these expression profiles, we used 
LIMMA [30] to compute ^-values that represented the 
significance of the differential expression of each gene 
between CS and HM. We desired to compute the 
response network (a subgraph of the STRING network) 
that captured the differential expression between CS 
and HM cultures. Accordingly, we used these per-gene 
^-values in concert with the BioNet algorithm [31] to 
compute a response network (within the STRING uni- 
versal network) of interconnected genes that is per- 
turbed in CS, in comparison to HM. We used the 
default parameters for BioNet, with FDR=0.001. Please 
see the BioNet publication for more details. The 
response network contained 876 nodes and 3423 edges. 
We collected functional annotations of the genes in this 
network from the Molecular Signatures Database 
(MSigDB) [20]. We only retained curated gene sets 
(category C2) and GO gene sets (category C5) from 
MSigDB, and we removed any functions that annotated 
fewer than five genes in the universal interaction 
network. 

With the universal interaction network, an interesting 
subnetwork of the universe (i.e., the BioNet response 
network), and functional annotations of the universal 
genes, we applied our network-based enrichment meth- 
ods. We tested the enrichment of 4035 MSigDB func- 
tions for 100,000 iterations. We also computed the 
enrichment of each function, using an empirical version 
of Equation 1, to allow us to compare our network- 
based enrichment p-vdlue for a given function to the p- 
value given by a standard gene set-based enrichment 
method (i.e., the one-sided version of Fisher's exact 
test). We did not use the analytical version of Fisher's 
test so that the enrichment ^-value estimated by our 
methods for each function was comparable to that com- 
puted by Fishers test. We observed that 177 of the 4035 



functions received a network-based enrichment j?-value 
of less than 10 -5 , the smallest possible empirical j?-value 
for 100,000 iterations (i.e., our methods returned a p- 
value of 0 for these functions). We present four of the 
top 177 functions and analyze their relationship to liver 
cultures. Figure 3 illustrates the networks induced by 
the genes annotated by these four top-ranking network- 
based enriched functions (i.e., we show Hf for each func- 
tion). The genes are colored based on their level of per- 
turbation in the contrast between CS and HM. Lightly- 
colored genes indicate little or no perturbation, while 
dark red (green) nodes denote significant up- (down-) 
regulation in CS compared to HM. 
Actin cytoskeleton 

Our network-based enrichment methods discovered the 
KEGG [3] pathway "Regulation of Actin Cytoskeleton" 
enriched with a ^-value less than 10~ 5 (rank 1). Com- 
paratively, the empirical p-vdlue for the one-sided ver- 
sion of Fishers exact test was 0.02857 (rank 819), more 
than three orders of magnitude larger than our j?-value. 
Actin organizes into thin filaments to provide structure 
to cells. Cell locomotion is often a driving force for the 
development of actin cytoskeleton. Actin filaments form 
a highly-organized, complex structure, and manipulation 
of the actin cytoskeleton enables adhesion to the sub- 
trate and movement of the cell [32]. However, hepato- 
cytes are generally stationary cells and are unlikely to 
use this mechanism. More specifically, HM hepatocytes 
lose the true hepatic phenotypes much faster than those 
in CS cultures. A specific characteristic of this loss is 
the production of actin fibers that help cells in the HM 
culture to better adhere to the underlying substrate. 
This phenomenon explains why many genes in the 
"Regulation of Actin Cytoskeleton" network (Figure 3) 
are consistently down- regulated in CS hepatocytes com- 
pared to those in HM. Note that this network contains 
some up-regulated genes, including Fgf21 and Myhl4. 
These genes are involved in a wide variety of biological 
processes including cell growth, tissue repair, and cell 
polarity. Their participation in several processes not 
directly related to actin development may explain why 
they are upregulated in CS compared to HM in this 
network. 

Cell cycle checkpoints 

We found the REACTOME [1] pathway "Cell Cycle 
Checkpoints" enriched with a p-value less than 10~ 5 
(rank 1), while the empirical version of Fishers method 
assigned a p-value of 0.05539 (rank 1034). Many genes 
in this network are essential to the cell cycle process. 
ANAPC2 codes for part of the anaphase-promoting 
complex, which is responsible for promoting eukaryotic 
cells from metaphase to anaphase during mitosis. Weel 
serves as a key checkpoint in the cell cycle by inhibiting 
entry into mitosis until the cell has grown to a certain 
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size, thus preventing the separation of the cell into two 
daughter cells that are too small. Mcm6 and Mcm7 are 
essential genes for DNA replication in eukaryotes, a sig- 
nificant process in the cell cycle. Under healthy condi- 
tions, mature hepatocytes do not readily divide except 
to replace damaged hepatocytes [33]. This observation 
may explain the downregulation of these genes in CS 
compared to HM. In other words, down-regulation of 
cell cycle checkpoints prevents the cell cycle from pro- 
gressing in CS cultures, in comparison to HM cultures. 
Nuclear receptor transcription pathway 
We identified the REACTOME "Nuclear Receptor Tran- 
scription Pathway" enriched with a network-based p- 
value less than 10 -5 (rank 1). Fisher's method returned a 
^-value of 0.0092 (rank 565). Nuclear receptors are 
responsible for sensing steroids, hormones, and other 
signaling molecules in the cell. Since one of the primary 
functions of the liver is the production of hormones, 
enrichment of this gene set in CS cultures is to be 
expected and was noted in an earlier study [21]. 
Furthermore, we expect many genes involved in this 
pathway to exhibit up-regulation in CS versus HM cul- 
tures. Figure 3 confirms that all the genes in the inter- 
section between this pathway and the BioNet response 
network are indeed up-regulated. Nrlh3, Nrli2, Rxrg, 



and Hnf4a encode parts of the liver X receptor, the 
pregnane X receptor, the retinoic acid receptor, and the 
hepatocyte nuclear factor, respectively. All these nuclear 
receptors are responsible for sensing signaling molecules 
that are highly relevant to liver function, e.g., metabo- 
lism of toxic substances and vitamins. 
Drug metabolism 

The KEGG pathway "Drug Metabolism of Cytochrome 
P450" was enriched according to our network-based 
method with a ^-value of less than 10~ 5 (rank 1). Fisher's 
method assigned the same function a ^-value of 0.01572 
(rank 681). All genes in this network were up-regulated 
in CS hepatocytes compared to those in HM. One of the 
primary functions of the liver is processing xenobiotics, 
including drugs. Cytochrome P450s represent a class of 
proteins that are responsible for breaking down various 
lipids, steroids, and xenobiotic (external) compounds. 
Primary hepatocytes actively express these proteins, even 
in the absence of external chemicals or drugs [34]. The 
enrichment of this pathway in CS cultures (in compari- 
son to HMs) supports the well-known phenomenon that 
HMs rapidly lose liver-specific functions. 
Gene sets with high network-based p-values 
Several functions received a high network-based p-value 
from our methods and a low set-based j?-value using 
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Fishers exact test. We ranked the genes sets in increas- 
ing order of their set-based ^-values, and we report the 
two highest-ranking gene sets that received a network- 
based p-value of 1. According to Fisher's exact test, we 
found the MSigDB gene set "Su Liver" significantly 
enriched with a j?-value of 1.77 -6 (rank 89) and the gene 
set "Chiaradonna Neoplastic Transformation Kras DN" 
significantly enriched with a p-v&lue of 2.48 x 10~ 6 
(rank 93). Our network-based functional enrichment 
approach deemed both functions insignificant with a p- 
value of 1 because there exist no interactions among the 
genes annotated by either function. Gene sets whose 
constituent proteins are sparsely-connected may be diffi- 
cult to interpret. 

Improving functional network coherence 

In the third and final application of our network-based 
enrichment methods, we sought to analyze how the 
addition of a collection of experimentally determined 
interactions impacts a dataset of known interactions. 
Specifically, given a universal interaction network in S. 
cerevisiae, we wanted to determine which biological 
functions became more coherent within the network 
after introducing new pairwise interactions discovered in 
a large-scale genetic interaction (GI) study. We retrieved 
the yeast interactome from the BioGRID database [23], 
which incorporates both genetic and physical interac- 
tions from multiple sources (5681 nodes and 97,862 
edges). BioGRID includes a sub-network of 14,421 GIs 
among 721 genes reported by Collins et al [22]. We 
refer to this list of GIs as "the GI study" or "the GIs" 
throughout this section. Our goal was to ask which GO 
biological processes became better connected when the 
GIs were added to BioGRID. We downloaded GO biolo- 
gical process annotations of the genes in BioGRID from 
GO [18]. As with the annotations of the BCI, we applied 
the true-path rule to the annotations, and we removed 
any functions that annotated fewer than five genes in 
the universal network and more than 100 genes in the 
interesting collection C. 

We then applied our network-based enrichment 
approach in two ways. The universal interaction net- 
work (G) remained the same in both cases. In the first 
case, we set the interesting network (H) to be the same 
as the universal network (which also includes edges 
from the GI study). This analysis is similar to the analy- 
sis performed on the BCI. Essentially, the network-based 
^-values in this case indicated which functions were 
enriched in the entire BioGRID network, including the 
GIs. For the second case, we set the interesting network 
to be the universal network with interactions from the 
GI study removed. If there were other sources of evi- 
dence for one of the GIs, we retained that edge in the 
interesting network. The network-based ^-values in this 



case indicated which functions were enriched in the uni- 
versal network if we ignored any information from the 
GI study. 

Let p a (f) be the network-based p- value for function/ 
from the first case, where we considered all edges in the 
univeral network, including those from the GI study. 
Similarly, let p r (f) be the network-based ^-value for 
function /from the second case, where we removed any 
edges in the GI study from the interesting network. We 
scored each function by s(f) - log 10 (^ r (/)/^(/)). Thus, 
functions whose network-based enrichment j?-value 
decreased (became more significant) as a result of add- 
ing the GIs received a positive score, functions whose 
enrichment remained the same received a score of 0, 
and functions whose ^-values increased (became less 
significant) as a result of adding the GIs received a 
negative score. We believed that highly positive-scoring 
functions were those biological processes whose coher- 
ence improved the most by adding the GIs. Of the 1443 
functions tested, 28 functions received a score greater 
than 1 (i.e., the ^-value decreased by more than 1 order 
of magnitude after adding the GIs), and only 1 function 
received a score less than -1 (i.e., the ^-value increased 
by more than 1 order of magnitude by including the 
GIs). Next, we discuss the three GO biological pro- 
cesses, "Chronological Cell Aging", "DNA Geometric 
Change", and "Histone Deacetylation", that received the 
2nd, 3rd, and 9th highest overall scores, respectively. 
These examples illustrate a new application of network- 
based functional enrichment that is sensitive to the 
increased connectivity among a set of genes upon the 
addition of multiple edges. However, gene set-based 
enrichment may not be sensitive to such an addition of 
edges if, for example, all newly-added edges were among 
genes already present in the network. 
Cell aging 

The GO biological process "Chronological Cell Aging" 
(GO:0001300) received a score of 2.6464 (p a = 0.0003 
and p r = 0.1329) exhibiting a decrease in network- 
based ^-value of nearly three orders of magnitude 
with the addition of the GIs. This process is the pro- 
gression of quiescent (non-dividing) cells from their 
inception to the end of their lifespans. Figure 4 
demonstrates that this increase in enrichment is a 
result of three GIs incident on SOD1, a Cu-Zn super- 
oxide dismutase that has a role in the detoxification 
of oxygen radicals. These enzymes catalyze the break- 
down of the superoxide radical (02-) into an oxygen 
molecule and hydrogen peroxide. Yeast cultures lack- 
ing SOD1 exhibit drastically decreased viability [35]. 
Strikingly, the interactions from the GI study were 
able to connect three previously disconnected subnet- 
works of genes annotated with chronological cell 
aging in S. Cerevisiae, 
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DNA geometric change 

Our approach identified the GO term "DNA Geometric 
Change" (GO:0032392) with a score of 2.32 as having a 
considerably decreased network-based enrichment Rvalue 
after adding the GIs (p a less than 10~ 4 , but p r = 0.0021). 
The ^-value decreased primarily because of the Gl between 
SSL1 and TOPI, which connected two previously discon- 
nected components of the network induced by genes asso- 
ciated with DNA geometric change. Additionally, the 
added GIs reinforced the connectivity among RFA1, RFA2, 
RAD54, SRS2, and their neighboring genes. Thus, the con- 
nected component induced by these genes was less likely to 
become disconnected during the randomization procedures 
used to perform network-based enrichment. 
Histone deacetylation 

Our enrichment approach assigned a score of 1.5955 (p a 
= 0.0005 and p r = 0.0197) to the GO term "Histone 
Deacetylation" (GO:0016575). The GIs improved the 
connectivity among several genes related to histone dea- 
cetylation (HDA1, HOS2, SDS3, SIF2, SNT1, UME1, 
and UME6). The GIs also provided a new source of evi- 
dence for many previously-known interactions, a prop- 
erty we can deduce from Figure 4 but one that is not 
explicitly considered by our methods. However, the 
interaction connecting HOS3 to EAF3 and the interac- 
tions between HST3 and CPR1/SIN3 are perhaps the 
most interesting as they provide new evidence for 
including HOS3, HST3, and HST4 in the large con- 
nected component associated with histone deacetylation. 
The GIs that connect HST3 and HST4 to the largest 
connected component of this biological process are 
quite striking, since these genes code for histone deace- 
tylases, and the GIs provide previously unknown inter- 
action data for including them in this network. 

Conclusions 

In this paper, we have presented a novel approach to 
functional enrichment that takes into account the 



pairwise relationships among genes annotated by a par- 
ticular function. We proposed two different methods for 
calculating ^-values to assess the significance of each 
function in a network-based context; we described these 
methods as "function randomization" and "network 
structure randomization". We required functions to 
have low enrichment ^-values based on both criteria. 
Our function randomization approach is a generalization 
of the one-sided version of Fisher's exact test, a standard 
formulation of functional enrichment for gene sets. Spe- 
cifically, after removing edges from the universal and 
interesting networks, the function randomization 
approach is equivalent to the one-sided version of Fish- 
er's exact test. Our network structure randomization 
approach offers a strictly network-centric method for 
determining functional enrichment. 

We utilized our methods on real biological data from 
three different organisms: human, rat, and yeast. In each 
organism, we showcased a different application of net- 
work-based enrichment. First, we used the human B cell 
interactome to demonstrate the capability of our meth- 
ods to simply discover enriched functions in a single 
network. We have noted that many standard gene set 
enrichment methods do not address this issue, as they 
require a universal gene set and an interesting collection 
of genes within the universal genes. Moreover, graph 
clustering algorithms that are designed to compute 
dense subgraphs might not detect sub-networks that are 
well-connected but not very dense, such as those corre- 
sponding to "Protein-DNA Complex Assembly" and 
"Glucose Metabolic Process" in Figure 2. Second, we 
applied our methods to a response network composed 
of rat genes perturbed in hepatocytes cultured in two 
different ways. This approach parallels traditional gene 
set enrichment methods, where given an interesting col- 
lection of genes, we seek functions overrepresented in 
the interesting genes with respect to some universe of 
genes. We identify several relevant functions that our 
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network-based method finds as being most significant 
but are far from the highest-ranking functions using the 
set-based Fisher's method. Third, we demonstrated a 
novel application of network-based enrichment to assess 
the functional contribution of a collection of genetic 
interactions in yeast to the underlying universe of 
known yeast interactions. 

Our work suggests many directions of future research. 
While we did not take edge reliabilities into account for 
the analysis presented here, our methods can be readily 
applied to weighted networks. Indeed, our approach for 
comparing sequences of component sizes can be used to 
compare sorted sequences of the sums of edge weights 
within each component (rather than the number of 
nodes). Identifying alternative methods for sequence com- 
parison is also an interesting question, particularly those 
methods for which the resulting distribution of sorted 
sequences can be determined analytically. Such compari- 
son methods may drastically improve the computational 
efficiency of our network-based enrichment approach. 
Our methods analyze the functions one at a time and 
often lead to redundant functions being identified as 
enriched. Additionally, we need to apply corrections for 
multiple hypotheses testing, thereby potentially decreasing 
our statistical power. Furthermore, our permutation-based 
approach for computing ^-values is very time-consuming. 
These three issues can potentially be tackled by generaliz- 
ing model-based approaches for gene function enrichment 
[15,16] to the domain of network-based function enrich- 
ment. After assuming an appropriate model for how biolo- 
gical processes may be perturbed in a cell, these methods 
proceed to infer the set of biological processes that best 
explain an input list of genes (the interesting collection) 
against a background list (the universe). We are develop- 
ing a modified version of our network-based approach 
that identifies groups of functions that jointly cover the 
network of interest, rather than consider functions indivi- 
dually. Applying these methods in the network context is 
an interesting and important problem. 
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